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Abstrac t 

This paper deals with the problem of classifying a pattern based on 
multiple observations made in a time-varying environment. The identity of 
the pattern may itself change. A Bayesian solution is derived, after which 
the conditions of the physical situation are invoked to produce a "Cascade" 
classifier model. Experimental results based on remote sensing data demon- 
strate the effectiveness of the classifier. 

Key words: pattern classif ication, mul ti temporal observations, remote 
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BAYESIAN CLASSIFICATION IN A 


TIME-VARYING ENVIRONMENT 
Philip H. Swain^ 

Introduction 

We pose the following pattern classification problem: 

A series of observations is made on a pattern in a time- 
varying environment. The identity of the pattern itself may 
change. It is desired to classify the pattern after the current 
observation is made, drawing on information derived from ear- 
lier observations plus knowledge about the statistical behavior 
of the environment. 

An example of such a situation arises in remote sensing ap- 
plications in which the sensor system can make multiple passes 
over the same ground area Ml- The identity of the ground cover 
may change between passes. In general it is desired to determine 
the current identity of the ground cover, but past observations 
can be helpful in accomplishing the identification. 

Approach 

The classification strategy we shall develop is a Bayes 
optimal (minimum risk) strategy [2]. In the ordinary single 


^"Philip H. Swain is with the School of Electrical Engineering and 
the Laboratory for Applications of Remote Sensing, Purdue Univer- 
sity, West Lafayette, IN 47907. 
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observation case, the approach is to select a decision rule so as 
to minimize the conditional average loss 

m 

L x (a, i } = l A p(u) |X) (1) 

x 1 j = i x 3 3 


where 


X is an n-variate observation (feature) vector 
(o» jf j=l, 2 ,..., m} is the set of m classes 
A^ is the cost resulting from classifying into 
class i a pattern actually from class j 
P((ik|X) is the conditional probability that, given 
observation X, its class is ui . 


That is, L (<o.) is the expected loss incurred if an observation X 
A 1 


is classified as Commonly [2] A^j is taken to be the "0-1 

loss function," i.e., 


x ij = °< 1 " i 


= 1, i / j 

Then Eg. (1) becomes 


(no cost for correct classifi- 
cation) 

(unit cost for an error) 


L (w . ) = 1 - p ( u> . | X ) 

X 


( 2 ) 


and an appropriate decision rule which will minimize L^tw^) is: 


Decide X e w . if and only if 

l 1 


p(X|w^)p(u) i ) = max p (X | ) p (w ^ ) 

j 


(3) 


where p ( X | oi ^ ) is the probability density function for the obser- 


vations associated with class or and p ( a> ^ ) is the a pr ior i proba- 
bility of class fu ^ . Thus the set of products f p (X | uk ) p ) , 
i=l, 2 ,..., m) is a set of discriminant functions for the class- 
ification problem. 

We now generalize this Bayes optimal approach to the case of 
a series of observations. It will be convenient to assume that ob- 
servations are made at two times. Generalization to a larger number 
of observation times is straightforward. 

Let X ] = X ( t j ) and X ? = X(t ? ) be n-variate random vectors, 
the pattern observations at times t ( and t ? , respectively. 

Let (v. = v ^ ( t j ) | i=l,2,..., m ( } be the set of possible 
classes at time t jf and let = u^(t ? )| i=l,2,..., m„ } be the 

set of possible classes at time t . 

2 

We define a compound conditional average loss 

m 

L x X <“i> -J : X ijP ( “jl X l' V (4) 

1 2 j = l J J 

where A^j is the cost resulting from classifying into class i, at 

time t . a pattern actually from class j. In this case p(w.|X , X ) 

2 ■* J j 1 1 2 

is the a posteriori probability that, given the observations X t 

at time t l and X ? at time t 2 , the class of the pattern at time 

t is w . . 

2 D 

Once again assuming a "0-1 loss function," Eq. (4) becomes 

L (u . ) = 1 - p (a) • |X , X ) (5) 

X,X, 1 1 1 1 2 
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which is minimized if we choose to maximize the a posteriori 
probability pfoi.JXj, X 2 ). Thus an appropriate set of discriminant 
functions for a Bayes optimal classification strategy is the set 
of a posteriori probabilities; i.e. 



As usual, however, we wish to derive a set of equivalent dis- 
criminant functions expressed in terms of class-conditional den- 
sity functions and a priori probabilities as in Eq. (3). This 
may be accomplished proceeding as follows. First we write: 






P(w,X, 

,x 2 ) 


p(u> 

1 X 

1 i 

• x 2 > 

P(X, , 

(6) 

For 

fixed Xj and 

X 

2 

, the 

denominator in Eq. (6) is constant. 

Let 

c = 1/p (Xj , X 2 ) 

and 

write Eq. 

(6) as 


P(u) 

' x , 

■V 

= cp ( id , X 



= c l p(X lf X 2 , v,w) 
v 

= C l p(x, , X ? I v,u)p(v,fa>) 

V) 

= C l p(Xj, x 2 I v,id) p(d) I v) p(v) (7) 
v 

The summation is over the classes which can occur at time t . The 
factor p(X , X | v,u>) is a joint class-conditional density; p(u>|v) 
may be interpreted as a transition probability (the probability 
that the class is w at time t given the class was v at time t ( ); 
and p(v) is an a priori probability. 
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Thus, the multiobservational decision rule analogous to Eq. 
(3) is: 

Decide X r u>. if and only if 

7 1 

m 

i 

l P(X i# x 2 l v k * «*> i )p(w i |v k )p(v k ) 

k“ X 


= max l p(Xj, X 2 | v k ,w. )p(w. |v k )p(v k ) 

. k=i 

and the set of discriminant functions is the set of sums of 
products : 


( 8 ) 


m 



i=l , 2 



A "Cascade" Implementation 

In practice, the terms in the discriminant functions must 

be estimated from "training samples." The most formidable job is 

estimating the m, • m joint class-conditional densities 

12 

p(X ( , X ? | \> k ,aK) , each of which is of dimension 2n. ? Clearly a 
large number of training samples will be required. When certain 
approximations can be justified, the situation is eased consider- 
ably. We shall now show that these approximations lead to a rather 
attractive model for a multitemporal classifier. 


? The observation vectors need not be of the same dimensionality. 

If Xj has n f components and X„ has n ? components, the p ( X , , X ? |v,to) 
is N-variate, where N = n, + n ? . 


6 


We are accustomed to assuming class-conditional independence 
in the spatial domain; i.e., given the class at a particular point, 
the random variable which is the measurement vector at that point 
is independent of the class or measurement vector at any other 
point. Applying this same idea to mult itemporal measurements at 
a given point, we say that given the classes v, at t and w- at 
t ? , the random variables X ] and x, are independent. Then we can 
write 


p(X i' X z ^ v k rtu i } = P (X il v k ,to i ) P (x 2 l v k ' w i ) 

and furthermore 

p(X i I v k' to i ) ® p(X i I v k } 

p(X ? I v k ,o> i ) = p(X ? |w. ) 


( 10 ) 


( 11 ) 


Imposing these conditions, it follows that 


P ( X j , X ? |v R ,U) i ) = p(X, | v R )p(X ? | u) i ) 


The discriminant functions, Eq. (9), then become 

■j l p(X t |v k )p(X 2 |u i )p(u i |v k )p(v k ) , 
k — 1 


r 1 t t . . . , m _ 


( 12 ) 


From F,q. (12) we can model the discriminant function calculations 
as indicated in Figure 1, from which we derive the term "cascade 
classifier" to describe this multistage classifier. 


Simulat ion and F.xpei imentnl Results 
The cascade classifier model w is programmed and applied to 
the analysis of a set of Landsat mult i spectra I data. 


The data. 
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collected by the satellite on two successive passes, eighteen days 
apart, over Fayette County, Illinois (see Table 1), were geo- 
metrically registered at Purdue University's Laboratory for Ap- 
plications of Remote Sensing. The objective of the analysis was 
to discriminate among the ground cover classes "corn", "soybeans", 
woods", and "other", where the last category was simply a catch- 
all consisting of water, pasture, fallow and other relatively 
minor ground covers. Each class was actually decomposed in the 
analysis process into a union of subclasses, each having a data 
distribution describable as approximately multivariate normal. 1 

To provide a baseline for comparison, the data from each of 
the passes was first analyzed separately. The a priori proba- 
bilities of the classes were approximated as being equal, and 557 
test samples, independent of the training samples, were used to 
evaluate the results. As shown in Table 1(a) and (b) , the per- 
formance of this conventional maximum likelihood classifier was 
68% correct for the June 29, 1973 data, and 72% correct for the 
July 17, 1973 data. 

To impleme t the cascade analysis, it was assumed unlikely 
that the ground cover would change identity over so short a time 
span. Accordingly, the transition probabilities were estimated 
as follows: 

ORIGINAL PAGE IS 
OF POOR QUALITY 

p(<D.|v.) = 0.8 for a). = v., (13a) 

r l 1 k l k 

and all other transition probabilities were sot equal and such that 


All probability densities were assumed to bo multivariate normal 
(Gaussian), characterized by mean vector and covariance matrix. 
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y p(w. | v. ) = 0.2. (13b) 

i 1 K 
“i^k 

Again the a priori probabilities were assumed equal and the same 
test samples were used to evaluate the results. 

The results of this mul titemporal classification, Table 1(c), 
were substantially better than either of the unitemporal analyses. 
The overall results were 84% correct. In addition, the performance 
for each class was better than the best attained for the class 
in either of the unitemporal analyses. The unitemporal and 
mul titemporal results are compared in Fiqure 2. 

The results can be sensitive, however, to the specification 
of the transition probabilities and a priori probabilities. This 
is demonstrated in the following experiment. 

Landsat data from two passes over Grant County, Kansas, were 
analyzed in a manner similar to that used for the Fayette County 
data. In this case, the two passes were separated by more than 
two months and a different set of classes was involved (Table 2). 

The transition probabilities were specified as in Eq. (13a) and 
(13b); equal a priori probabilities were assumed. 

As shown in Table 2 and Figure 3, in this case the overall per- 
formance of the mul titemporal cascade classifier was only marqinally 
better than the best unitemporal result. A closer look at the 
class-by-class results is revealinq. The larqest detractors from 
the multitemporal results were the classes "alfalfa" ard "pasture." 
In both of these cases, the uni temporal results for the second 
pass were substantially lower than those obtained in the first 
pass. (There arc physical explanations for why this is reasonable, 
but this is not germane to our exploration of classifier behavior.) 
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Let us examine the impact that the relatively arbitrary 
assignment of transition probabilities has on the classification 
results. In case the actual transition probabilities are not 
Known (which was true for the cited examples) , the assignment 
can be made anywhere between two extremes. On the one hand, it. 
could be assumed that 

pU.| \> k ) ~ ~ # K — 1,2,..., ntj 

i.e., equiprobable transitions. Then the discriminant functions 
have the form 

m 

J P<X 1 |v k )p(X 2 | Ui );f P U k > 

K 1 1 

i 

= =" P(X. |u. ) l p(X. | v. )p(v. ) 

l • 1 k=l ‘ K 

= iT PfXj^lplX,). 

Sincr m and p(Xj) will be common to each of the discriminant 

functions, the decision will depend only on p(X 0 |ui.) and will be 

independent of the first-stage results. 

On the other hand we could make p(u>.|v.) = 1 and p(u.'v.) = 0, 

r i 1 i r i 1 

j ? i. Then the discriminant functions become 
p( X j |v i )p( x ? |w i )p(v i ) . 

Thus, in a sense, the contributions from the two stages are weighted 
equally. 
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There is no way to make the first stage input dominate the 
second stage. 

In view of these considerations, another classification of 
the Grant County data was performed. In this case, the transition 
probabilities p(uo|v^) were set equal to unity for the "alfalfa" 
and "pasture" classes in order to qivc as much strenqth as pos- 
sible to the first stage results. Table 3 aud r iaure 3 show the 
outcome o' this classification. The confusing influence resulting 
from the second stage data has been reduced. 

It is interestina to compare the results obtained using the 
cascade classifier to results produced by a "conventional" maximum 
likelihood classifier using all of the mul ti temporal features si- 
multaneously. To perform the latter classifications, eoual £ nr ier i 
) obabilities were assume . The results were: 

Fayette County: RQ.8 percent correct 

Grant County: 04.1 percent correct 

It is curious that neither oT these results is any better than the 
cascade classifier results achieved. it is possible that these 
slightly pooler results represent the price paid for having to 
estimate 3-dimensic statistics as opposed to 4-dimersional 
statistics in the face of limited training data. 

Discussion and Conclusions 

The approach wo have adopted for classifying data in a non- 
stationary environment w.is has* d on application of classical 
s t a t i s t i c a 1 dec i s io. t lien t y in a t ra i gli t forward manner . 1 b <weve r , 

We used the eond i t ions O t the i rob 1 1 'III t < * a opr OX i mu t c ; •( n ')* t he 
s t 1 1 i s t i ca 1 quantities involved. This step si inj difiol the inter- 
dependencies of the d.i t .i involve! an ' led to a "cascade c lass i ' i ■ r 1 


n 


model. In the time-varying environment, this model is seen to: 

(1) Successfully incorporate the temporal information in 
who classification process, resulting in improved classification 
accuracy; 

(2) Reduce the dimensionality of the probability functions 
used and thereby make less stringent demands with respect to the 
size of the training set required; 

(3) Facilitate distribution of the computational load over 
time. 

Each time a set of observations becomes available, dis- 
criminant functions are calculated which can be used, if desired, 
to make a classif ication. However, the values of the discrim- 
inant functions are also passed along and contribute to a new set 
of discriminant functions calculated when the next set of observation 
is obtained. Although we have demonstrated the use of the cascade 
model only for the case of two stages, extension to an arbitrary 
number of stages presents no difficulty. 

The prospective user of this approach should be aware that 
a casual implementation of the likelihood computers may result 
in computational difficulties of two sorts: loss of precision 

and very large computation times as comnared with, say, a con- 
ventional Gaussian maximum likelihood classifier. Both of these 
difficulties can be overcome or at least substantially reduced 
by appropriate measures (scalinq, ignoring zero terms, etc.) 
in carrying out the likelihood computations. 
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Table 1. Test results for classification 



of 


the Fayette County, 
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Illinois, data. 


(a) June 29, 1973 data 



No. of 

Percent 

No . of 

Samples 

Classified into 

Group 

Samples 

Correct 

CORN 

OTHERS 

SOYBEAN 

WOODS 

CORN 

186 

65.1 

121 

36 

24 

5 

OTHERS 

100 

40.0 

33 

40 

22 

5 

SOYBEAN 

227 

82.4 

10 

30 

187 

0 

WOODS 

44 

72.7 

0 

4 

8 

32 


TOTAL 557 

164 110 

241 

42 

OVERALL PERFORMANCE = 

68.2 percent correct 






(b) July 17, 

1973 





No . of 

Percent 

No . of 

Samples 

Classified Into 

Group 

Samples 

Correct 

CORN 

OTHERS 

SOYBEAN 

WOODS 

CORN 

186 

89.2 

166 

16 

1 

3 

OTHERS 

100 

45.0 

38 

45 

15 

2 

SOYBEAN 

227 

73.6 

24 

36 

167 

0 

WOODS 

44 

56.8 

4 

9 

6 

25 

TOTAL 

557 


232 

106 

189 

30 

OVERALL 

PERFORMANCE = 

72.4 percent 

correct 






(c) Multitemporal results 

(cascade 

classifier) 



No . of 

Percent 

No. of Samples Classified Into 

Group 

Samples 

Correct 

CORN 

OTHER 

SOYBEAN 

WOODS 

CORN 

186 

90.3 

168 

11 

4 

3 

OTHERS 

100 

48.0 

29 

48 

20 

3 

SOYBEAN 

227 

94.3 

3 

10 

214 

0 

WOODS 

44 

84.1 

0 

5 

2 

37 

TOTAL 

557 


200 

74 

240 

43 

OVERALL 

PERFORMANCE = 

83.8 percent correct 
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Table 2. Test results for classification of the 
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Grant Courty, Kansas, data. 


(a) May 9, 1974 

I 



No . of 

Percent 

No . of 

Samples 

Classified Into 


Group 

Samples 

Correct 

ALFALFA 

CORN 

FALLOW 

PASTURE 

WHEAT 

ALFALFA 

58 

84.5 

49 

0 

0 

0 

9 

CORN 

428 

57.0 

0 

244 

183 

1 

0 

FALLOW 

526 

54.4 

0 

196 

286 

36 

8 

PASTURE 

1513 

52.6 

127 

148 

220 

796 

227 

WHEAT 

913 

82.5 

97 

17 

0 

49 

767 


TOTAL 

3455 

273 605 

689 

882 

1006 

Overa 1 1 

Performance = 

62.0 percent correct 







(b) July 20, 

1974 






No . of 

Percent 

No. of Samples 

Classified 

Into 


Group 

Samples 

Correct 

ALFALFA 

CORN 

FALLOW PASTURE 

WHEAT 

ALFALFA 

58 

5.2 

3 

3 

0 

10 

42 

CORN 

428 

53.0 

15 

227 

105 

15 

66 

FALLOW 

526 

62.9 

0 

113 

331 

5 

77 

PASTURE 

1513 

42.4 

64 

329 

213 

641 

266 

WHEAT 

913 

76.2 

22 

108 

33 

58 

709 

TOTAL 

3455 


104 

780 

682 

729 

1160 

Ovcral 1 

Performance = 

55.3 percent 

: correct 






(c) 

Mult i temporal results 

(cascade 

classi f i cr) 




No . of 

Percent 

Number of 

samples classified T 

nto 

Group 

Samples 

Correct 

ALFALFA 

CORN FALLOW 

PASTURE 

WHEAT 

ALFALFA 

58 

41.4 

24 

0 0 

2 

32 

CORN 

428 

59.6 

5 

255 165 

1 

2 

FALLOW 

526 

76.4 

0 

107 402 

2 

1 5 

PASTURE 

151 3 

4 6.3 

101 

205 224 

701 

282 

WHEAT 

9 30 

88. 1 

77 

19 0 

1 3 

821 

TOTAL 

34 5 5 


207 

586 791 

719 

1152 



I 
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Table 3. Cascade classifier results for 
transition probabilities (Grant County 


adjusted 
data) . 


Group 

No . of 
Samples 

Percent 

Correct 

ALFALFA 

58 

94.8 

CORN 

428 

70.3 

FALLOW 

526 

68.1 

PASTURE 

1513 

48.1 

WHEAT 

930 

89.1 


TOTAL 3455 

Overall Performance = 65.7 perccn 


Number of samples classified Into 


ALFALFA 

CORN 

FALLOW 

PASTURE 

WHEAT 

55 

0 

0 

0 

3 

5 

301 

122 

0 

0 

0 

139 

358 

7 

22 

105 

211 

195 

727 

275 

82 

9 

0 

10 

829 

247 

660 

675 

744 

1 1 20 


correct 






Figure 2. Test results for Fayette County data. 
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H July 20 data YA Cascade with modified weights 



